Two problems with Japanese LLMs
Many people confuse the two issues about Japanese LLM, so I made a picture. https://gyazo.com/17f5a1e496eeeaac5100fbfc34b21785
There are people who talk about the "Japanese LLMs don't mean anything" kind of thing from the 1 perspective.
2 Perspectives
It is important to have a large pipeline to the "AI that thinks across languages" that will grow more and more in the future.
"AI that thinks across languages" is like a newly discovered oil field, and value is gushing forth.
Users of languages with narrower pipes do not enjoy much of the value that comes out of this.
If there is a head start, then the "difference by size" in 1 shrinks.
Currently, performance is better when communicating with GPT4 in English than in Japanese.
Talking about how "another smaller model" is futile, but we need a "tokenizer + alpha layer suitable for Japanese".
One solution is like this.
https://gyazo.com/4c9f57fda2d440f0c32a4d71c1b948d1
Whether this is beneficial remains to be seen.
A "better to try than to do nothing" mentality.
PS
What if it's not a headache?
nishio If we assume that "the larger the scale of the training data, the higher the value", then the total amount of sentences written in Japanese is not comparable to the total amount of sentences written in English, and the difference in terms of the number of speakers will not decrease. If we assume that "the larger the scale of learning data, the higher the value", then the total volume of sentences written in Japanese is not equal to the total volume of sentences written in English, and the difference in terms of the number of speakers will not decrease. Just as in the Meiji Era, "If we don't make English the official language, we're in trouble, aren't we?" as it was in the Meiji Era
---
This page is auto-translated from /nishio/日本語LLMに関する2つの問題 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.